Call rebalance API once all pods are updated #625

radu-gheorghe · 2023-09-20T12:26:28Z

Timid attempt to fix #615

It works on my machine, it seems to fix the problem, but:

I have no tests. Yet. I have a feeling this is the hard part 🙂
I guess I'll need to update managed-updates.md a bit as well.
There are a couple of TODOes in my patch where I'm not sure if things are right - any feedback there would be great!
Last but not least, with or without my patch, I always got a DOWN shard at the end of my rolling update. That's a separate issue, right? Still, it's always reproducible in my environment, maybe due to the fact that I'm testing with some docs in my collection and I'm having my IDE (with Solr Operator) talk to Kubernetes via a port forward to the "common" service, which sometimes fails because the underlying pod gets restarted, so I get a few "connection refused" errors until I put it back. Still, it should be something that the Operator handles gracefully, IMO.

Any feedback is welcome, I'd be glad to push this forward!

HoustonPutman

This is great! I have two edge cases that I think we need to account for:

In the case that all the pods are updates, the async rebalance command will be started. The user can then change the spec again, and the updatedPodCount will then again be 0, because all pods will be out of date. At that point we will have forgotten about the rebalance command that we started. We need to make sure that we complete any async command before we start a new one, so I suggest that we split off the async command to a separate ClusterOp, that we start after the restart is complete. That way another restart will only begin once the rebalance is finished.
If I'm correct, we are rebalancing even if replicas weren't moved around during the rolling restart. In my opinion, we should only rebalance if replicas were moved around.

HoustonPutman

I think this is good to go @radu-gheorghe , give it a look!

radu-gheorghe · 2023-10-10T08:35:28Z

Looks good to me, too, thanks a lot @HoustonPutman!

I already had a look yesterday, too, and noticed you couldn't help yourself and fixed your 2) above 🙂

HoustonPutman · 2023-10-10T14:28:08Z

I already had a look yesterday, too, and noticed you couldn't help yourself and fixed your 2) above 🙂

Haha I remembered that we had the lovely readinessConditions that told us if the cloud was ephemeral, so it was very easy to do it!

calling rebalance API once all pods are updated

c1324c8

brickpattern mentioned this pull request Sep 21, 2023

Create Collection fails after Upgrade #627

Closed

HoustonPutman requested changes Oct 2, 2023

View reviewed changes

HoustonPutman added 8 commits October 6, 2023 11:57

Make a BalanceReplicas clusterOp

ca31cee

Merge remote-tracking branch 'apache/main' into pr/625

cdc0e6b

Fix scaling tests to find annotations quicker

c54bd76

Cleanup operations before they are queued

c709823

Set a retryDuration for balanceReplicas

ee057cf

Some more test fixes

f64cc02

Only rebalance for updates with data migration

513509b

Add a changelog entry, update docs

90e8063

HoustonPutman approved these changes Oct 9, 2023

View reviewed changes

Merge branch 'main' into rebalance_after_rolling_update

ccb5bbf

HoustonPutman merged commit 0ce9517 into apache:main Oct 10, 2023
3 checks passed

Provide feedback